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Abstract 

We consider the problem of n-class classihca- 
tion (n > 2), where the classiher can choose 
to abstain from making predictions at a given 
cost, say, a factor a of the cost of misclassifi- 
cation. Designing consistent algorithms for such 
n-class classification problems with a ‘reject op¬ 
tion’ is the main goal of this paper, thereby ex¬ 
tending and generalizing previously known re¬ 
sults for n = 2. We show that the Crammer- 
Singer surrogate and the one vs all hinge loss, 
albeit with a different predictor than the standard 
argmax, yield consistent algorithms for this prob¬ 
lem when a = More interestingly, we de¬ 
sign a new convex surrogate that is also consis¬ 
tent for this problem when a = | and operates 
on a much lower dimensional space (log(n) as 
opposed to n). We also generalize all three sur¬ 
rogates to be consistent for any a € [0, ^]. 


1. Introduction 

In classification problems, one often encounters cases 
where it would be better for the classifier to take no de¬ 
cision and abstain from predicting rather than making a 
wrong prediction. For example, in the problem of medical 
diagnosis with inexpensive tests as features, a conclusive 
decision is good, but in the face of uncertainty it is better to 
not make a prediction and go for costlier tests. 


For the case of binary actions, this problem has been called 


‘classification with a reject option’ (Bartlett & Weg 

camp 

2008 Yuan & Wegkamp 2010 Grandvalet et al. 

2008 

Fumera & Rolil |20021 12004 Fumera et al.l 2000 

2003 

Golfarelli et al.l 1199711. Yuan and Wegkamp 

2010 


show that many standard convex optimization based pro¬ 
cedures for binary classification like logistic regression, 
least squares classification and exponential loss minimiza¬ 
tion (Adaboost) yield consistent algorithms for this prob¬ 


lem. But as Bartlett and Wegkamp (20081 show, the algo¬ 
rithm based on minimizing the hinge loss (S VM) requires a 
modification to be consistent. The suggested modification 
is rather simple - use a double hinge loss with three lin¬ 
ear segments instead of the two segments in standard hinge 
loss, the ratio of slopes of the two non-flat segments de¬ 
pends on the cost of abstaining a. 


In the case of multiclass classification however there ex¬ 
ist no such results and it is not straightforward to gener¬ 
alize the double hinge loss to this setting. To the best of 
our knowledge, there has been only empirical and heuris¬ 
tic work on multiclass version of this problem, (|Zou et aU 


2011 Simeone et al. 2012 Wu et al.[ 2007| l. In this paper. 


we give a formal treatment of the multiclass problem with 
a ’reject’ option and provide consistent algorithms for this 
problem. 


The reject option is accommodated into the problem of 
n-class classihcation through the evaluation metric. We 
now seek a function h : A—>^{1, 2,... ,n,n + 1}, where 
X is the instance space, and the n classes are denoted by 
{1, 2,..., n} = [n] and n + 1 denotes the action of ab¬ 
staining or the ‘reject’ option. The loss incurred by such a 
function on an example {x, y) with h{x) = f is given by 


{ 1 ift^y and t ^ n + \ 
a ift = n + l (1) 

0 ift = y 

where a G [0,1] denotes the cost of abstaining. We will 
call this loss the abstain(a) loss. 

It can be easily shown that the Bayes optimal 
risk for the above loss is attained by the function 
h* : ... ,n + 1} given by 


h*(x) 


argmaXj^e[n]Px(y) if maxyg[„] Px(y) > 1 - a 
n + 1 Otherwise 


( 2 ) 
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where Px{y) = P{Y = y\X = x). The above can be seen 
as a natural extension of the ‘Chow’s rule’ ( |Chow| |1970[ ) 
for the binary case. It can also be seen that the interesting 
range of values for a is [0, as for all a > the 
Bayes optimal classifier for the abstain(Qf) loss never ab¬ 
stains. For example, in binary classification, only a < | is 
meaningful, as higher values of a imply it is never optimal 
to abstain. 


For small a, the classifier /i* acts as a high-confidence clas¬ 
sifier and would be useful in applications like medical diag¬ 
nosis. For example, if one wishes to learn a classifier for di¬ 
agnosing an illness with 80% confidence, and recommend 
further medical tests if it is not possible, the ideal classifier 
would be /ig 2 , which is the minimizer of the abstain(0.2) 
loss. If Of = i, the Bayes classifier /i* has a very appealing 
structure - a class y G [n] is predicted only if the class y 
has a simple majority. The abstain(a) loss is also useful in 
applications where a ‘greater than 1 — a conditional prob¬ 
ability detector’ can be used as a black box. For example a 
greater than ^ conditional probability detector plays a cru¬ 
cial role in hierarchical classification ( |Ramaswamy et ak] 
20151. (Details in supplementary material.) 


Abstain(a) loss with a = | will be the main focus of our 
paper and will be the default when the abstain loss is re¬ 
ferred to without any reference to a. (As will be the case 
in Sections]^ |^|^and|^) 


As it can be seen that the Bayes classifier /i* depends only 
on the conditional distribution of Y\X, any algorithm that 
gives a consistent estimator of the conditional probability 
of the classes, e.g. minimizing the one vs all squared loss, 
( |Ramaswamy & Agarwal| |2012[ |Vernet et al.| |201 1[ ), can 
be made into a consistent algorithm (with a suitable change 
in the decision) for this problem. 


However smooth surrogates that estimate the conditional 
probability do much more than what is necessary to solve 
this problem. Consistent piecewise linear surrogate min¬ 
imizing algorithms, on the other hand do only what is 
needed and can be expected to be more successful. For ex¬ 
ample, least squares classification, logistic regression and 
SVM are all consistent for standard binary classification, 
but the SVM (which minimizes a piecewise linear hinge 
loss surrogate) is arguably the most widely used method. 
Piecewise linear surrogates have other advantages like eas¬ 
ier optimization and sparsity (in the dual) as well, hence 
finding consistent piecewise linear surrogates for the ab¬ 
stain loss is an important and interesting task. 


We show that the n-dimensional multiclass surrogate of 
Crammer and Singer (Crammer & Singer 200 1|| and the 


simple one vs all hinge surrogate loss (Rifkin & Klautau 


20041 both yield a consistent algorithm for the abstain(4j 


loss. It is interesting to note that both these surrogates 


are not consistent for the standard multiclass classification 
problem (|Tewari & Bartlett] |2007| |Lee et ak] |2004| |Zhang 
|2004| ). 

More interestingly, we construct a new convex piecewise 
linear surrogate, which we call the binary encoded predic¬ 
tions (BEP) surrogate that operates on a log 2 (n) dimen¬ 
sional space, and yields a consistent algorithm for the n- 
class abstain(i) loss. When optimized over comparable 
function classes, this algorithm is more efficient than the 
Crammer-Singer and one vs all algorithms due to requiring 
to only find log 2 (n) functions over the instance space, as 
opposed to n functions. This result is surprising because, it 
has been shown that one needs to minimize at least a n — 1 
dimensional convex surrogate to get a consistent algorithm 
for the standard n-class problem, i.e. without the reject op¬ 
tion ( Ramaswamy & Agarwal[ 2012| l. Also the only known 
generic way of generating consistent surrogate minimizing 
algorithms for a given loss matrix ( [Ramaswam y & Agar- 
wal 2012|l, when applied to the n-class abstain loss would 


give a n-dimensional surrogate here. 


It is important to note the role of a - the cost of abstaining. 
While conditional probability estimation based surrogates 
can be used for designing consistent algorithms for the n- 
class problem with the reject option with any a G (0, 
the Crammer-Singer surrogate, the one vs all hinge and the 
BEP surrogate and their corresponding variants all yield 
consistent algorithms only for a G [0, |]. While this may 
seem restrictive, we contend that these form an interesting 
and useful set of problems to solve. We also suspect that, 
abstain(a) problems with a > ^ are fundamentally more 
difficult than those with a < ^, for the reason that eval¬ 
uating the Bayes classifier (i* (x) can be done for a < | 
without finding the maximum conditional probability - just 
check if any class has conditional probability greater than 
(1 — a) as there can only be one. This is also evidenced by 
the more complicated partitions of the simplex induced by 
the Bayes optimal classifier for a > ^ as shown in Eigure 

[I] 


1.1. Overview 

We start with some preliminaries and notation in Section 
HI In Section we give excess risk bounds relating the 
excess Crammer-Singer multiclass surrogate risk and one 
vs all hinge surrogate risk to the excess abstain(l) risk. 
In Section]^ we give our log 2 (n) dimensional BEP surro¬ 
gate, and give similar excess risk bounds. In Section]^ we 
frame the learning problem with the BEP surrogate as an 
optimization problem, derive its dual and give a block co¬ 
ordinate descent style algorithm for solving it. In Section]^ 
we give generalizations of the Crammer-Singer, one vs all 
hinge and BEP surrogates that are consistent for abstain(a) 
loss for a G [O, I]. In Sectionj^we include experimental 
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results for all three algorithms. We conclude in Section]^ 
with a summary. 

2. Preliminaries 

Let the instance space be X, the finite set of class labels be 
y = [n] = n}, and the finite set of target labels 

be given by T = [n + 1] = {1,..., n + 1}. Given train¬ 
ing examples {Xi,Yi), ..., {Xm, Ym) drawn i.i.d. from a 
distribution D on X x 3^, the goal is to learn a prediction 
model h : X —J^T. 

For any given a G [0,1], the performance of a predic¬ 
tion model h : X^T is measured via the abstain(Q;) 
loss i°‘ : y X T—>'K+ from Equation 0. r{y,t) de¬ 
notes the loss incurred on predicting t when the truth is 
y. We will find it convenient to represent the loss function 
-.y xT —7'K+ as a loss matrix L“ G gj_ 

ements = ^°‘{y, t) for y G [n],t G [n + 1], and column 
vectors = (Lij,..., L^t)^ G M" for t G [n + 1]. The 
abstain(a) loss matrix and a schematic representation of 
the Bayes classifier for various values of a given by equa¬ 
tion 0 are given in Figure [T] for n = 3. 

Specifically, the goal is to learn a model h : X—>-T with 
low expected loss or f“-error 

e/;;[h]=E^x,Y)^D[r{Y,h{X))] . 

Ideally, one wants the £“-error of the learned model to be 
close to the optimal -error 

er^ ’* = inf er^ [/i] . 

^ h:X^r 

An algorithm, which outputs a (random) model 
hm ■ X^T on being given a random training sample 
as above, is said to be consistent w.r.t. if the f“-error 
of the learned model hm converges in probability to the 
optimal for all distributions D: er^ [hm\ —^ er^) ■ Here 
the convergence in probability is over the learned classifier 
hm as a function of the training sample distributed i.i.d. 
according to D. 

However, minimizing the discrete f“-error directly is com¬ 
putationally difficult; therefore one uses instead a surrogate 
loss function ip ■. y x (where IR_(. = [0,oo]), for 

some d G Z_|_, and learns a model f : A—by mini¬ 
mizing (approximately, based on the training sample) the 
■^-error 

ert,[f] = E(x,r)^nlP(Y,f(X))] . 

Predictions on new instances x G X are then made by ap¬ 
plying the learned model f and mapping back to predictions 
in the target space T via some mapping pred : giv¬ 

ing h{x) = pred(f(a;)). 


Under suitable conditions, algorithms that approximately 
minimize the p-trtor based on a training sample are known 
to be consistent with respect to p, i.e. to converge in prob¬ 
ability to the optimal p-enor 

er^’* = inf er^[f] . 

Also, when p is convex in its second argument, the result¬ 
ing optimization problem is convex and can be efficiently 
solved. 

Hence, we seek a surrogate and a predictor (^,pred), with 
p convex over its second argument, and satisfying a bound 
of the following form holding for all f : A— 

er^ [pred of] - er^’* < ^ (er^[f] - er^’*) 

where ^ : K—J-K is increasing, continuous at 0 and ^(0) = 
0. A surrogate and a predictor (-(/ijpred), satisfying such 
a bound, known as an excess risk transform bound, would 
immediately give an algorithm consistent w.r.t. from an 
algorithm consistent w.r.t. p. We derive such bounds w.r.t. 
the £2 loss for the Crammer-Singer surrogate, the one vs all 
hinge surrogate, and the BEP surrogate, with ^ as a linear 
function. 


3. Excess Risk Bounds for the 

Crammer-Singer and One vs All Hinge 
Surrogates 


In this section we give an excess risk bound relating the ab- 
stain loss f, and the Crammer-Singer surrogate p^^ (Cram¬ 


mer & Singer 2001 |l and also the one vs all Hinge loss. 


Define the surrogate p^^ : [n] x M"—and predictor 
pred^^ : K”—;>[n -I-1] as 


pred“(u) 


(maxuj — Uy + 1)+ 

j^v 

I argmax,g[„]U, if U(i) - U(2) > r 
1 n + 1 otherwise 


where (a)+ = max(a, 0), is the i th element of the 
components of u when sorted in descending order and r G 
(0,1) is a threshold parameter. 

We proceed further and also define the surrogate and pre¬ 
dictor for the one vs all hinge loss. The surrogate : 

[n] X K”—and predictor pred°^^ : K”—>-[n -f 1] are 
defined as 


p°'^^{y, u) = ^ l(y = t)(l - Uz)+ + l{y p-i){l + Ui)+ 
2=1 


pred°''^u) 


argmaXjgj^jUj if max^ Uj > r 

n + 1 otherwise 
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( 0 . 1 , 0 ) 


( 0 . 1 , 0 ) 


( 0 . 1 , 0 ) 


0 11a 

10 1a 
110a 

(a) 

Figure 1. (a) The abstain(Q;) loss matrix (with n = 3); (b,c,d) the partition of the simplex A 3 , depicting the optimal prediction for different 
conditional probabilities, induced by the Bayes classifier for the abstain(|), abstain(|) and abstain(|) losses respectively. 





where (a)+ = max(a, 0) and r G (—1,1) is a threshold 
parameter, and ties are broken arbitrarily, say, in favor of 
the label y with the smaller index. 

The following is the main result of this section, the proof 
of which is in Appendix [A| and [B] 

Theorem 1. Let n S N , res G (0,1) and tqva G (—1,1)- 
Then for all f : A—l-K" 


tion dimension (CC-dimension) ([Ramaswamy & Agarwal 


20121 of the abstain loss is at most [log 2 (n)]. 


For the purpose of simplicity let us assume n = 2'^ for 
some positive integer dQ Let B : [n]—)■{+!,—1}'^ be 
any one-one and onto mapping, with an inverse mapping 
B~^ : {-|-1, —1}'^—;>[n]. Define the BEP surrogate 

: [n] X ;>K+ and its corresponding predictor 
pred™^ : + 1] as 


eri>[pred“ of]-er^* < 

er^^[pred°X>f]-erli* < 


(erf[f]-erf--) 
2min(Tcs, 1 - res) 
(erB [f] ’ j 

2(1 — ItovaI) 


Remark: It has been pointed out previously by Zhang 
P004| l, that if the data distribution D is such that 
mSiKyPx{y) > 0.5 for all x G X, the Crammer-Singer 
surrogate and the one vs all hinge loss are consis¬ 
tent with the zero-one loss when used with the standard 
argmax predictor. Our Theorem implies the above ob¬ 
servation. However it also gives more - in the case that 
the distribution does not satisfy the dominant class assump¬ 
tion, the model learned by using the surrogate and predictor 
(i/'‘^^,pred^^) or pred°'^^) asymptotically still gives 

the right answer for instances having a dominant class, and 
fails in a graceful manner by abstaining for instances that 
do not have a dominant class. 


4. Excess Risk Bounds for the BEP Surrogate 

The Crammer-Singer surrogate and the one vs all hinge sur¬ 
rogate, just like surrogates designed for conditional prob¬ 
ability estimation, are defined over an n-dimensional do¬ 
main. Thus any algorithm that minimizes these surrogates 
must learn n real valued functions over the instance space. 
In this section, we construct a [log 2 (n)] dimensional con¬ 
vex surrogate, which we call as the binary encoded predic¬ 
tions (BEP) surrogate and give an excess risk bound relat¬ 
ing this surrogate and the abstain loss. In particular these 
results show that the BEP surrogate is calibrated w.r.t. the 
abstain loss; this in turn implies that the convex calibra¬ 


ip^^^{y,u) = {maxBj{y)uj -f 1)+ 
j^[d] 


redBEPfu) = J ^ ^ min,g[d] \u,\ < t 

|i?“^(sign(—u)) Otherwise 

where sign(u) is the sign of u, with sign(O) = 1 and r G 
(0,1) is a threshold parameter. 

Define the sets where = {u G : 

pred™^(u) = k}. Which evaluates to 

Uy = {u G : inaxBj{y)uj < —r} for y G [n] 
j 

l^n+i = {u G R^^ : min \uj \ < r} 
j 


To make the above definition clear we will see how the sur¬ 
rogate and predictor look like for the case of n = 4 and 
T = \. We have d = 2. Let us fix the mapping B such that 
B{y) is the standard d-bit binary representation of {y — 1), 
with —1 in the place of 0. Then we have. 


^'^eP(2,u) 

^'^eP(3^u) 

^'^eP(4,u) 


pred|^'’(u) 


= (max(-ui,-zi 2 )-I-1)+ 
= (max(—ui, ^ 2 ) + 1)+ 

= (max(ui, —U 2 ) + 1)+ 

= (max(ui,M2) + l)-i- 



if Ui > i,M2 > 5 
if ui > i,M2 < - 5 
if Ui < -i, U2 > 5 
if Ui < -\,U 2 < 

otherwise 


*If n is not a power of 2 , just add enough dummy classes that 
never occur. 
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Figure 2. The partition of induced by pred'l® 


Figure S gives the partition induced by the predictor 
pred^EP 

2 

The following is the main result of this section, the proof 
of which is in Appendix [C] 

Theorem 2. Let n G N and t G (0,1). Let n = 2'^. Then 
for all f : 


er 


D 


[pred^^*" of] — er^* < 



2 min(T, 1 — t) 


Remark: The excess risk bounds for the CS, OVA, and 
BEP surrogates suggest that r = ^ is the best choice for 
CS and BEP surrogates, while r = 0 is the best choice for 
the OVA surrogate. However, intuitively r is the thresh¬ 
old converting conhdence values to predictions, and so it 
makes sense to use r values closer to 0 (or —1 in the case 
of OVA) to predict aggressively in low-noise situations, and 
use larger r to predict conservatively in noisy situations. 
Practically, it makes sense to choose the parameter r via 
cross-validation. 


5. BEP Surrogate Optimization Algorithm 

In this section we frame the problem of hnding the linear 
(vector valued) function that minimizes the BEP surrogate 
loss over a training set with G and 

yi G [n], as a convex optimization problem. Once again, 
for simplicity we assume that the size of the label space 
n — 2'^ for some d G Z_|_. The primal and dual of the 
resulting optimization problem with a norm squared regu- 
larizer is given below; 


Primal problem: 


1=1 j=l 


such that Mi G [m ], j G [d\ 
> Bj{yi)vj-JK^ -f 1 


>0 


Dual problem: 


^ m m 

max 

aGH™x(d+l) ^ ’ 2A ^ ^ 


i—1 i—1 i' — 1 

such that Vi G [m],j G [d] U {0} 

d 

CX-i^j ^ 0 , ^ ^ ^ ■ 

j'=0 

where j/(o;) — 


We optimize the dual as it can be easily extended to work 
with kernels. The structure of the constraints in the dual 
lends itself easily to a block co-ordinate ascent algorithm, 
where we optimize over {aij : j G {0,..., d}} and hx 
every other variable in each iteration. Such methods have 
been recently proven to have exponential convergence rate 
for SVM-type problems ( |Wang & Lin| |2014| l, and we ex¬ 
pect results of those type to apply to our problem as well. 


The problem to be solved at every iteration reduces to a I 2 
projection of a vector g® G on to the set 5^ = {g G : 
g^b* < 1}, where b* G {±1}^^ is such that bj = Bj{yi). 
The projection problem is a simple variant of projecting 
a vector on the li ball of radius 1, which can be solved 
efficiently in 0{d) time (Duchi et al. 2008| l. The vector g* 
is such that for any j G [d]. 


V 7 = 





6. Abstain(a) Loss for a < | 

The excess risk bounds derived for the CS, OVA hinge 
loss and BEP surrogates apply only to the abstain(i) loss. 
But it is possible to derive such excess risk bounds for 
abstain(Q!) with a G [O, with slight modifications to the 
CS, OVA and BEP surrogates. 

Dehne 7 /''^^’“ : [n] x K”^K+, ^ and 

^BEP.a . ^ with n = 2"^ as 
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2 • max amax 7 (uj — Uy), 

V 

(1 — a) max7(Mj — Uy) ) + 2a 
3=Ay ) 


2 - (^l{y = i)a{l - Ui)+ 

+ 1 (?/ *)(1 - a)(l 




2 • max 


\ 


(1 - a)max-f{By{y)uj) 

jG[d] 


2a 


where, 7 ( 0 ) = max(a, —1) and B : [n]—)■{—1,1}^^ is any 
bijection. Note that , ■j/jOVA.i _ ^ova 

■0BEP,i _ ^BEP^ 

One can show the following theorem which is a generaliza¬ 
tion of Theorems [T] and The proof proceeds along the 
same lines as the proofs of Theorems [T] andand is hence 
omitted. 

Theorem 3. Let n S N, r S (0,1), r' € (—1,1) and a G 
[0, i]. Letn = 2^. Then for all i : g : 


r jCS 1 
erp [pred^ o g] - er^, 


< 


1 


er 




< 


2 min(T,l-T) 
V 

1 


r 1 

[g] - ei^ 


r jOVA 1 

era [pred^, o g] - er^ ’ 


2(1 - |r 


-TT (er 

•' V 


D 


1 -0 
g] - 




er^ [pred^ 

< 


>f]- 

1 


I'd 


2 min(T, 1 — t) 


,BEP,q ,BEP, 

ler^ [f]-er(" 




'■') 


kernel spaces on a 2 -dimensional 8 class synthetic data 
set and show that the the abstain(i) loss incurred by the 
trained model for all three algorithms approaches the Bayes 
optimal under various thresholds. 

The dataset we used was generated as follows. We ran¬ 
domly sample 8 prototype vectors Vi,..., Vg G with 
each Vy drawn independently from a zero mean unit vari¬ 
ance 2D-Gaussian, A/^( 0 ,l 2 ) distribution. These 8 proto¬ 
type vectors correspond to the 8 classes. Each example 
(x, y) is generated by first picking y from one of the 8 
classes uniformly at random, and the instance x is set as 
X = Vj, -f 0.65 • u, where u is independently drawn from 
Jf{0, 12 )- We generated 12800 such (x, y) pairs for train¬ 
ing, and another 10000 instances, for testing. 

The CS, OVA, BEP surrogates were all optimized over a 
reproducing kernel Hilbert Space (RKHS) with a Gaussian 
kernel and the standard norm-squared regularizer. The ker¬ 
nel width parameter and the regularization parameter were 
chosen by grid search using a separate validation set|^ 

As Figurej^indicates, the expected abstain risk incurred by 
the trained model approaches the Bayes risk with increas¬ 
ing training data for all three algorithms and intermediate t 
values. The excess risk bounds in Theorems [T]and|2]break- 
down when the threshold parameter r G {0,1} for the CS 
and BEP surrogates, and when r G {—1,1} for the OVA 
surrogate. This is supported by the observation that, in Fig- 
urej^the curves corresponding to these thresholds perform 
poorly. In particular, using r = 0 for the CS and BEP algo¬ 
rithms implies that the resulting algorithms never abstain. 

Though all three surrogate minimizing algorithms we con¬ 
sider are consistent w.r.t. abstain loss, we find that the BEP 
and OVA algorithms use less computation time and sam¬ 
ples than the CS algorithm to attain the same error. How¬ 
ever, the BEP surrogate performs poorly when optimized 
over a linear function class (experiments not shown here), 
due to its much restricted representation power. 


Remark: When n = 2, the Crammer-Singer surrogate, the 
one vs all hinge and the BEP surrogate all reduce to the 
hinge loss and a is restricted to be at most | to ensure the 
relevance of the abstain option. Applying the above exten¬ 
sion for a < I to the hinge loss, we get the ‘generalized 
hinge loss’ of Bartlett and Wegkamp (2008|l. 


7. Experimental Results 

In this section give our experimental results for the algo¬ 
rithms proposed on both synthetic and real datasets. 

7.1. Synthetic Data 

We optimize the Crammer-Singer surrogate, the one vs all 
hinge surrogate and the BEP surrogate, over appropriate 


7.2. Real Data 


We ran experiments on real multiclass datasets from the 
UCI repository, the details of which are in Table In each 
of these datasets if a train/test split is not indicated in the 
dataset we make one ourselves by splitting at random. 


All three algorithms (CS, OVA and BEP) were optimized 
over an RKHS with a Gaussian kernel and the standard 
norm-squared regularizer. The kernel width and regular¬ 
ization parameters were chosen through validation - 10 - 
fold cross-validation in the case of satimage, yeast, 
vehicle and image datasets, and a 75-25 split of the 


^We used Joachims’ SVM-light package (Joachims 1999 i for 
the OVA and CS algorithms. 
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Figure 3. (a) Performance of the CS surrogate for various thresholds as a function of training size (b) Performance of the OVA surrogate 
for various thresholds as a function of training size (c) Performance of the BEP surrogate for various thresholds as a function of training 
size 


Table 1. Error percentages of the three algorithms when the rejection percentage is fixed at 0%, 20% and 40%. 


Reject; 

0 % 

20 % 

40% 

Algorithm; 

CS 

OVA 

BEP 

CS 

OVA 

BEP 

CS 

OVA 

BEP 

satimage 

10.25 

8.3 

8.15 

5.6 

2.5 

2.4 

2.9 

0.9 

0.6 

yeast 

44.4 

38.8 

42.7 

34.5 

26 

29.7 

24 

17 

19.8 

letter 

4.8 

2.8 

4.6 

1.4 

0.1 

0.6 

0.4 

0 

0.1 

vehicle 

31.5 

17.1 

20.5 

24.6 

8.2 

13 

16.4 

5.5 

6.1 

image 

5.8 

5.1 

4.2 

2.2 

1.6 

1.6 

0.6 

0.6 

0.3 

covertype 

32.2 

28.1 

29.4 

23.6 

19.3 

20.4 

16.3 

11.7 

12.8 


Table 2. Details of datasets used. 



# Train 

# Test 

# Peat 

# Class 

satimage 

4,435 

2,000 

36 

6 

yeast 

1,000 

484 

8 

10 

letter 

16,000 

4,000 

16 

26 

vehicle 

700 

146 

18 

4 

image 

2,000 

310 

19 

7 

covertype 

15,120 

565,892 

54 

7 


train set into train and validation for the letter and 
covertype datasets. For simplicity we set r = 0 (or 
T = — 1 for OVA) during the validation phase. 

The results of the experiment with the CS, OVA and BEP 
algorithms is given in Table 2. The rejection rate is fixed 
at some given level by choosing the threshold r for each 
algorithm and dataset appropriately. As can be seen from 
the Table, the BEP algorithm’s performance is comparable 
to the OVA, and is better than the CS algorithm. However, 
Table which gives the training times for the algorithms, 
reveals that the BEP algorithm runs the fastest, thus making 
the BEP algorithm a good option for large datasets. The 
main reason for the observed speedup of the BEP is that 
it learns only log 2 (n) functions for a n-class problem and 
hence the speedup factor of the BEP over the OVA would 


Table 3. Time taken for learning final model and making predic¬ 
tions on test set (does not include validation time) 


Algorithm 

CS 

OVA 

BEP 

satimage 

2153s 

76s 

44s 

yeast 

5s 

7s 

2 s 

letter 

9608s 

1055s 

313s 

vehicle 

3s 

3s 

Is 

image 

222 s 

16s 

6 s 

covertype 

47974s 

23709s 

6786s 


potentially be better for larger n. 

8. Conclusion 

The multiclass classification problem with a reject option, 
is a powerful abstraction that captures controlling the un¬ 
certainty of the classifier and is very useful in applications 
like medical diagnosis. We formalized this problem via an 
evaluation metric, called the abstain loss, and gave excess 
risk bounds relating the abstain loss to the Crammer-Singer 
surrogate, the one vs all hinge surrogate and also to the 
BEP surrogate which is a new surrogate and operates on a 
much smaller dimension. Extending these results for other 
such evaluation metrics, in particular the abstain(a) loss 
for a > j, is an interesting future direction. 
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Appendix 


We break the Proof of Theorem[2into two parts consisting 
of the proof of excess risk bounds for the CS surrogate and 
the OVA surrogate respectively. 


Case lb: u e 

We have that U(i) — U( 2 ) < r. 

Let q = E^ 6 argmax^,x, We then have 


A. Proof of Excess Risk Bounds for the 
Crammer Singer Surrogate 

Define the sets Ui,..., Un+i such that Ui is the set of vec¬ 
tors u in K", for which pred^^(u) = i 

Uy = {u e K” : > Uj + T for all j ^ y}; y S [n] 

i^n+l — {u C R" : «■(!) < 'U( 2 ) + t}. 


- 2(1 -Py) 



{2q - l)(u(2) - U(i)) -l + 2py 
{2py- l)(l-r) 


( 8 ) 


The following lemma gives some crucial, but straightfor¬ 
ward to prove, (in)equalities satisfied by the Crammer- 
Singer surrogate. 

Lemma 4. 


Vy G [n],Vp G A„ 

= 2{1-py), (3) 

pT^CS(Q) ^ ( 4 ) 

Vu G K",Vy G argmaXjUj , Vy' ^ argmax^Ui 

V^'^^(y,u) > M( 2 )-M(i) + 1, (5) 

' 0 “( 2 /',u) > M(i)-M( 2 ) + 1 , ( 6 ) 

where By is the vector in K" with 1 in the y*^ position and 
0 everywhere else. 


The part of Theorem[T]proved here is restated below. 
Theorem. Let n G N and r G (0,1). Then for all 


er^[pred^® of] — er^* < 


D 



2 min(T, 1 — r) 


f : 


Proof. We will show that Vp G A„ and all u G 


The last inequality follows from M( 2 ) — W(i) > —r and the 
following observations. If g > then tt(i) = M( 2 ), and if 
g < Py we have q < \. 

P^Vd“(u)-mmp^.£t = p^4+i-p^£y ^Py~\ 


From Equations and 0 we have 
p^i/j‘“^(u) — inf p^i/>(u') 

> 2(1 - T)(p^£p,gdcs(u) - mmp^ 4 ) ( 10 ) 


Case Ic: u G K” \ ((^^ U Uf+f) 

We have pred^^ (u) = y' y. Also py' < \ — Py < \ and 

M(l) = Uy' > M(2) + T. 


p^^P^^iu)-p^f^^^iey) 


ID 




X! -f Py/'!/'^^(y',u) 

\i=Li9^y' 

- 2(1 -Py) 


> (1 - 2 py/)(My' - M( 2 )) - 1 +Py 

> 2T{py — py>) (From Case Ic) (11) 


p^'0^^(u) — inf p^t/)(u') 

> 2 min(r, 1 - r) (p^^pred“(u) “ mm P^^t) ■ (7) 

The Theorem simply follows from linearity of expectation. 
Case 1: Py > ^ for some y G [n]. 

We have that y G argmin^p^f^. 

Case la: u€Uy 

The RHS of equation (j7]l is zero, and hence becomes trivial. 


We also have that 

P^^pred“(u) - mmp^^t = p^€y/ - p^€y = Py - Py' 


( 12 ) 


From Equations •tn) and ( [T2| we have 

p^-ip‘^^{u)- inf p^i/j‘=^(u') 


> 2T(p^€p„dcs(u) - minp^^t) 


(13) 


Case 2: pyi < 7 for all y' G 
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We have that n + 1 G argmin^p^^t 

Case 2a: u s Un+i (orpred^^(u) = n + 1) 

The RHS of equation (j7]i is zero, and hence becomes trivial. 

Case 2b: u G M" \ (or pred^®(u) ^ n + 1) 

Let pred^^(u) = argmax^iti = y. We have that U(i) = 
Uy > M(2) + T andpy < i. 




0 



^ Piilj’^^{i,u)+pyilj'^^{y,u) \ -1 
\i=i-,i¥=y 


(l-2py)(U(l) -U(2)) 

(1 — 2py)(r) (From Case 2b) (14) 


We also have that 


- nimpT.£t = p^£y - p>^tn+l = ^-Py 


From Equations and ([T5| we have 


(15) 


p>'“'’(u)- inf p'-iP'-%u') 

u'G'R'^ 


> 2 r(pT£p,,dCS(u) - mmpT^t) (16) 

Equation ([^, and hence the Theorem, follows from Equa¬ 
tions ([TOli,‘^}Oli and ( [Th] ). □ 


B. Proof of Excess Risk Bounds for the One vs 
All Hinge Surrogate 

Define the sets Ui,..., lAn+i such that Ui is the set of vec¬ 
tors u in K", for which pred°'''^(u) = i 

UI = {ugW^ :Uy>T,y = argmax,g[„]uj, y G [n] 
U^_^_i = {u G K" : < T for all j G [n]}. 


The following lemma gives some crucial, but straightfor¬ 
ward to prove, (in)equalities satisfied by the OVA hinge 
surrogate. 

Lemma 5. 

Vy G [n], Vp G A„ , Vu G M" 
pTt/,0''A(2 . - 1) = 4(1-p„) (17) 

pT^OVA(_^) = 2 (18) 

^ Uj — 2uy + n (19) 
16["] 

where By is the vector in K" with 1 in the position and 
0 everywhere else. 


The part of Theorem[T]proved here is restated below. 
Theorem. Let n G N and t G (0,1). Then for all f : 


er^ [pred^ ^ ° f] — ef 


D 


< 


2 ( 1 - 


(^er^ [f] -erf, ’ j 


Proof. We will show that Vp G A„ and all u G [—1,1]'^ 
p^i/:°'^^(u) — inf p^'0°'*^(u') 

> 2(1 - |t|)(p^^p,,jova(u) - mmp^^t) ( 20 ) 

the Theorem simply follows from the observation that for 
all u G K" clipping the components of u to [—1,1] does 
not increase u) for any y, and by linearity of ex¬ 

pectation. 

Case 1: py > ^ for some y G [n]. 

We have that y G argmin^p^fj. 

Case la: u G [-1,1]" nU^ 

The RHS of equation ( [20| is zero, and hence becomes triv¬ 
ial. 

Case lb: uG [-1,1]” n 
We have that max^ uj < r. 

CH ^ 

> y ^(1 - 2pi)ui + n 

> X! “ ‘^Pi)ui + {‘^Py - l)(-'r) + n 
*6H\{y} 

> {2pi - 1) -I- {2py - l)(-r - 1) -I- n 

iG[n] 

= (2 p,-l)(-r-l) + 2 

And hence we have 

- p^i,°'^\2 -By-l) 

9 (2py-1)(-T-l)-f 2-4(1-Py) 

= (2py-l)(l-r) (21) 


We also have 


P^^pied?VA(u) - mmp^€t = p^£n+l - P^iy = Py - ^ 

( 22 ) 

Prom Equations © and ([22| we have for all u G 


[-i,if 

p^rP^^^iu)- inf p^rP°^^{u') 

> 2(1 - T)(p^£p,ed 5 VA(„) - mmp^^t) (23) 
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Case lc:uG[-l,l]"\(Z^;uZ^;+i) 

We have pred°^'^(u) = y' ^ y. Also Py/ < \ Uy' > t 
and Uy! > Uy. 


El ^ 

> ^(1 - ‘^Pi)Ui + n 


> I ^ (1 - 2pi)u^ + {1 - 2py,){T) + 

yi^[n]\{y'} 


n 


- ( X! - 1 ) + (1 - 2 py/)(r + 1 ) + 1 


k zG n 


And hence we have, 


pT,/,OVA(^) _ pT^OVA (2 . _ 1 ) 

> 2 + (1 — 2pyl){T + 1) — 4 + Apy 

= {l-2py,){T + l)+2{2py-l) 

> (1 - 2,Py'){T + 1 ) + (1 + t) • {2py - 1 ) 

= 2{l + T){Py - Py,) . (24) 


andpy < 


(pT^»*(u)) 


- 2 


f it 


(1 - 2pi)ui +n \ -2 


\i=l 


> 


(1 - ‘^Pi)ui + (1 - ‘2py){T) + n j - 2 

AeH\{!/} 


> 


(2pi - 1) + (1 - 2py)(r + 1) + n I - 2 

yiG[ri] / 

(l_2py)(r + l) (27) 


We also have that 

P^^pi-ed?''A(u) - minp^^t = iy - P^^n+l = ]^-Py 


(28) 

From Equations (271 and (28 1 we have for all u G 


p^-0O''^(u)- inf p^t/;°''^(u') 

> 2(1 + t) (p^€pred?''A(u) - mm p^it) (29) 

Equation ( [20| , and hence the Theorem, follows from Equa¬ 
tions (|2 T|i, \26\ and ( |29] ). □ 


We also have that 


P^^pred?VA(u) - minp^^t = P^^y' - P^^y =Py- Py' 

(25) 


Erom Equations and (j2^ we have for all u G 
[-1,1]"\KUZ7 ;+i) 


p>^''^(u)- inf p>^''''(u') 

> 2(1 + t) (p^€pred?''A(u) - min p^It) (26) 


Case 2: pyi < i for all y' G [n] 

We have that n + 1 G argmin^p^^t 

Case 2a: u G U^+i 

The RHS of equation (0 is zero, and hence becomes trivial. 
Case 2b: uG [-1,1]” \ 

Let pred°’*^(u) = argmax^Ui = y. We have that Uy > t 


C. Proof of Excess Risk Bounds for the BEP 
Surrogate 


The following lemma gives some crucial, but straightfor¬ 


ward to prove, (in)equalities 

Lemma 6. 

'^y,y' G W,p G A„,u 

P^^BBP(O) 

Theorem. Let n G N and t 
all f : 

eri [predr of]-er^/ 


atisfied by the BEP surrogate. 

K" ,j/r 5"ysign(-u)) 

= 2(1-Py) (30) 

= 1 (31) 

> — min|uy|-|-l (32) 

3 

> min|uj|-|-l (33) 

3 

(0,1). Let n = 2'^. Then for 



2 min(T, 1 — t) 


Proof. We will show that Vp G A„ and all u G 
p^’0®^^(u) — inf p^'j/)®^^(u') 

> 2min(T, 1 - T)(p^€pjedBEP(u) - mmp^^t) (34) 
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The theorem follows by linearity of expectation. 

Case 1: Py > | for some y s [n] 

We have that y G argmin^p^fj 

Case la: uGUy (or pred™'’(u) = y) 

The RHS of equation ( [34| is zero, and hence becomes triv¬ 
ial. 

Case lb: u e U^+i (orpred™^(u) = n -I-1) 

Let y' = i3“^(sign(—u)). We have minj |uj| < r. 




i&[n]\{y'} 

)32l,l33l 

Py'{- min \uj\) + (1 -py)(min |uj|) -f 1 


> (2pj^ - l)(-r)-f 1 . 




The last inequality in the above follows from the observa¬ 
tion that if y' ^ y, then pyi < 1 — < i. We thus have 

ra 

V (2py-l)(l-T) . (35) 


We also have that 


Vdf’’(u) - mmpT^t = p^£„+i - =py-^ 

(36) 

From Equations p5| ) and (36 1 we have that 

pTr/,eEP(^)_ inf pT^BEP^^,) 

> 2(1 - r)(p^£p,,dfP(u) - minpT^t) (37) 


Case Ic: u e \ U 

Let i7“^(sign(—u)) = pred(u) = y' for some y' ^ y. We 
have py! < 1 — Py < \, and min^ \uj\ > t and 




> 


n 

Py'(-min|uj|) -f (1 - py )(min |) -f 1 

3 3 

r(l — 2pyi) + 1 (From case Ic) 


We also have that 


- mmp^^t = p^^/ - p^4 = Py 

From Equations p8| ) and (39 1 we have that 
p^'i/)®^^(u) — inf p^'i/:®^^(u') 


Py' 

(39) 


> 2(r)(p^£p„dBEP(u) - mmp^ft) (40) 


Case 2: py < ^ for all y G [n] 

We have that n -f 1 G argmin^p^^t 

Case 2a: u G 

The RHS of equation ( [34| is zero, and hence becomes triv¬ 
ial. 

Case 2b: ugIR'^\Z7^+i 

Let i7“^(sign(—u)) = y' = pred®®*’(u) for some y' G [n]. 
We havepj,/ < ^ and minj \uj\ > t. 

n 

Pv'i^^^^i.y'. u) -f ^ u) - 1 

—Py' min \uj \ -f (1 — Py') min \uj\ 

3 3 

{l- 2 pv.)T (41) 



We also have that 


P ^pi-edBEB(u) - minp t-t=v f-y' -V f-n+l = ^-Py' 


From Equations (|4T]) and (42 1 we have that 


— inf p^'i/:®®*"(u') 

> 2r(p^£p„dBEi>(u) - minp^^t) 


(42) 


(43) 


Equation ( |3^, and hence the Theorem, follows from equa¬ 
tions (|J7]i, (|40|i and ( |43| ). □ 


Hence we get 

J 30 I 

pTi/)®^'’(u) - p^il)^^^{-B{y)) ^ 2T{py - Py') (38) 






